Search CORE

39 research outputs found

An analysis of local and global solutions to address Big Data imbalanced classification: a case study with SMOTE preprocessing

Author: A Fernandez
A Fernandez
A Fernandez
CLP Chen
D García-Gil
GEAPA Batista
H Karau
J Huang
J Maillo
MJ Basgall
NV Chawla
PD Gutierrez
R Barandela
RC Prati
S Ramírez-Gallego
T White
V López
X Meng
Publication venue
Publication date: 03/09/2019
Field of study

Addressing the huge amount of data continuously generated is an important challenge in the Machine Learning field. The need to adapt the traditional techniques or create new ones is evident. To do so, distributed technologies have to be used to deal with the significant scalability constraints due to the Big Data context. In many Big Data applications for classification, there are some classes that are highly underrepresented, leading to what is known as the imbalanced classification problem. In this scenario, learning algorithms are often biased towards the majority classes, treating minority ones as outliers or noise. Consequently, preprocessing techniques to balance the class distribution were developed. This can be achieved by suppressing majority instances (undersampling) or by creating minority examples (oversampling). Regarding the oversampling methods, one of the most widespread is the SMOTE algorithm, which creates artificial examples according to the neighborhood of each minority class instance. In this work, our objective is to analyze the SMOTE behavior in Big Data as a function of some key aspects such as the oversampling degree, the neighborhood value and, specially, the type of distributed design (local vs. global).Instituto de Investigación en Informátic

Crossref

Servicio de Difusión de la Creación Intelectual

Classifying protein-protein interaction articles using word and syntactic features

Author: A Ceol
B Aranda
B Settles
C Blaschke
D Rebholz-Schuhmann
E Buyko
GD Bader
GEAPA Batista
H Jang
HJ Lowe
J Björne
JR Curran
K Sugiyama
L Salwinski
L Tanabe
LH Smith
M Huang
M Krallinger
M Krallinger
M Kubat
MF Porter
P Baldi
RK Ando
S Kim
S Kim
S Nash
Sun Kim
T Mitsumort
T Zhang
VN Vapnik
W John Wilbur
Y Miyao
Y Niu
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Crossref

Springer - Publisher Connector

PubMed Central

An insight into imbalanced Big Data classification: outcomes and challenges

Author: A Fernández
A Fernández
A Thusoo
B Krawczyk
C Bunkhumpornpat
CP Chen
D Lyubimov
E Elsebakhi
E Ramentol
F Hu
F Hu
G Haixiang
GEAPA Batista
GM Weiss
H He
H Yu
I Triguero
I Triguero
J Alcalá-Fdez
J Dean
J Huang
J Li
JA Sáez
JM Tomczak
K Kambatla
L Rokach
M Galar
M Galar
M Wasikowski
NV Chawla
NV Chawla
PC Zikopoulos
R Baeza-Yates
R Barandela
R Blagus
RC Prati
S Alshomrani
S Barua
S Elhag
S Kamal
S Owen
S Río
S Río
S-H Park
T Jo
T White
V García
V López
V López
V López
X Meng
X Wu
Y Guo
Y Sun
Y-S Chen
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Big Data applications are emerging during the last years, and researchers from many disciplines are aware of the high advantages related to the knowledge extraction from this type of problem. However, traditional learning approaches cannot be directly applied due to scalability issues. To overcome this issue, the MapReduce framework has arisen as a “de facto” solution. Basically, it carries out a “divide-and-conquer” distributed procedure in a fault-tolerant way to adapt for commodity hardware. Being still a recent discipline, few research has been conducted on imbalanced classification for Big Data. The reasons behind this are mainly the difficulties in adapting standard techniques to the MapReduce programming style. Additionally, inner problems of imbalanced data, namely lack of data and small disjuncts, are accentuated during the data partitioning to fit the MapReduce programming style. This paper is designed under three main pillars. First, to present the first outcomes for imbalanced classification in Big Data problems, introducing the current research state of this area. Second, to analyze the behavior of standard pre-processing techniques in this particular framework. Finally, taking into account the experimental results obtained throughout this work, we will carry out a discussion on the challenges and future directions for the topic.This work has been partially supported by the Spanish Ministry of Science and Technology under Projects TIN2014-57251-P and TIN2015-68454-R, the Andalusian Research Plan P11-TIC-7765, the Foundation BBVA Project 75/2016 BigDaPTOOLS, and the National Science Foundation (NSF) Grant IIS-1447795

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Springer - Publisher Connector

Repositorio Institucional Universidad de Granada

On the relevance of preprocessing in predictive maintenance for dynamic systems

Author: A Chuang
A Graves
A Savitzky
AJ Smola
AP Bradley
B Schölkopf
B Schölkopf
BS Yang
BW Silverman
C Cernuda
C Cernuda
C Cernuda
C Cernuda
C Phua
C Wang
Carlos Cernuda
CE Shannon
D Cabrera
D Freedman
D Li
D Lin
D Wolpert
D Wu
DB Rubin
DL Wilson
E Lughofer
F Fleuret
F Serdio
F Serdio
F Serdio
G Brown
G Qiu
G Weiss
GEAPA Batista
GEP Box
H Peng
H Yang
H Zou
HB Mann
HJ Weaver
I Daubechies
I Guyon
I Guyon
I Jolliffe
I Tomek
J Gerretzen
J Ville
JB Tenenbaum
Jorma Laurikkala
K Greff
K Tschumitschew
K Varmuza
KV Branden
L Breiman
L Breiman
L Maaten
L Tan
L Zhang
M Bartlett
M Frigo
M Hubert
M Jung
M Li
MA Oliveira
MR Smith
N Friedman
N Kwak
NE Huang
NV Chawla
NV Chawla
O Troyanskaya
P Duhamel
P Mahalanobis
P Welch
PE Hart
R Battiti
R Kohavi
R Nikzad-Langerodi
R Nunkesser
R Tibshirani
RC Sharpley
RD Maesschalck
RM Sakia
RN Bracewell
S García
S Gelper
S Hochreiter
S Kadambe
S Oba
S Roweis
SA Dudani
SE Said
SG Mallat
Sudipto Guha
T Benkedjouh
T Hastie
T Hastie
T Hofmann
T Jo
T Loutas
TY Wu
V Vapnik
W Pedrycz
Y Saeys
Publication venue
Publication date: 01/01/2018
Field of study

The complexity involved in the process of real-time data-driven monitoring dynamic systems for predicted maintenance is usually huge. With more or less in-depth any data-driven approach is sensitive to data preprocessing, understood as any data treatment prior to the application of the monitoring model, being sometimes crucial for the final development of the employed monitoring technique. The aim of this work is to quantify the sensitiveness of data-driven predictive maintenance models in dynamic systems in an exhaustive way. We consider a couple of predictive maintenance scenarios, each of them defined by some public available data. For each scenario, we consider its properties and apply several techniques for each of the successive preprocessing steps, e.g. data cleaning, missing values treatment, outlier detection, feature selection, or imbalance compensation. The pretreatment configurations, i.e. sequential combinations of techniques from different preprocessing steps, are considered together with different monitoring approaches, in order to determine the relevance of data preprocessing for predictive maintenance in dynamical systems

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

BCAM's Institutional Repository Data

Monthly variation in the probability of presence of adult Culicoides populations in nine European countries and the implications for targeted surveillance

Author: A Afonso
A Guisan
A Ibañez-Justicia
A Liaw
AC Cuéllar
AJ Tatem
Alexander Mathis
Ana Carolina Cuéllar
Anders Lindström
Anders Stockmarr
Andreas Baum
Anna Orłowska
ARW Elbers
B Hoffmann
B Hoffmann
B Pinior
B Purse
Bethsabée Scheid
BL Random Forests
Bruno Mathieu
BV Purse
BV Purse
BV Purse
C CALVETE
C Calvete
C Kaufmann
C Liu
Carlos Barceló
Claire Garros
D Cianci
David Chavernac
Delphine Delécolle
DR Cutler
DW Ramilo
E Dijkstra
E Ducheyne
E Ducheyne
E Kiel
E Thiry
EDENext
EFSA
EJ Wittmann
Ellen Kiel
European Commission
Franz J. Conraths
Franz Rubel
GEAPA Batista
H Guis
H Mehlhorn
H Mehlhorn
Henrik Skovgard
Ignace Rakotoarivony
Inger Hamnes
J Pearce
J Peters
J Rushton
J-F Toussaint
Jan Chirico
Javier Lucientes
Jean-Claude Delécolle
JN Mandrekar
Jonathan Lhoir
JPW Scharlemann
Jörn Gethmann
K Brugger
K Brugger
Katharina Brugger
KR Searle
L Breiman
Lene Jung Kjær
M Ander
M Kuhn
M Kuhn
Magdalena Larska
Marcin Smreczak
Marie-Laure Setier-Rio
Mats Gunnar Andersson
MD Ortega
Miguel Ángel Miranda Chueca
N Haider
N Hartemink
N Lunardon
N Selemetas
P-H Clausen
Petter Hopp
PS Mellor
PS Mellor
R Core Team
R Lühken
R Meiswinkel
R Venail
Renke Lühken
René Bødker
RJ Hijmans
RJ Hijmans
RM Du Toit
Roger Venail
Rosa Estrada
S Caracappa
S Carpenter
S Carpenter
S Kalluri
S Steinke
S Zientara
SA Nielsen
SI Hay
Sonja Steinke
Ståle Sviland
Søren Achim Nielsen
T Fawcett
Thomas Balenghien
TP Robinson
Wesley Tack
Xavier Allène
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

Background: Biting midges of the genus Culicoides (Diptera: Ceratopogonidae) are small hematophagous insects responsible for the transmission of bluetongue virus, Schmallenberg virus and African horse sickness virus to wild and domestic ruminants and equids. Outbreaks of these viruses have caused economic damage within the European Union. The spatio-temporal distribution of biting midges is a key factor in identifying areas with the potential for disease spread. The aim of this study was to identify and map areas of neglectable adult activity for each month in an average year. Average monthly risk maps can be used as a tool when allocating resources for surveillance and control programs within Europe. Methods : We modelled the occurrence of C. imicola and the Obsoletus and Pulicaris ensembles using existing entomological surveillance data from Spain, France, Germany, Switzerland, Austria, Denmark, Sweden, Norway and Poland. The monthly probability of each vector species and ensembles being present in Europe based on climatic and environmental input variables was estimated with the machine learning technique Random Forest. Subsequently, the monthly probability was classified into three classes: Absence, Presence and Uncertain status. These three classes are useful for mapping areas of no risk, areas of high-risk targeted for animal movement restrictions, and areas with an uncertain status that need active entomological surveillance to determine whether or not vectors are present. Results: The distribution of Culicoides species ensembles were in agreement with their previously reported distribution in Europe. The Random Forest models were very accurate in predicting the probability of presence for C. imicola (mean AUC = 0.95), less accurate for the Obsoletus ensemble (mean AUC = 0.84), while the lowest accuracy was found for the Pulicaris ensemble (mean AUC = 0.71). The most important environmental variables in the models were related to temperature and precipitation for all three groups. Conclusions: The duration periods with low or null adult activity can be derived from the associated monthly distribution maps, and it was also possible to identify and map areas with uncertain predictions. In the absence of ongoing vector surveillance, these maps can be used by veterinary authorities to classify areas as likely vector-free or as likely risk areas from southern Spain to northern Sweden with acceptable precision. The maps can also focus costly entomological surveillance to seasons and areas where the predictions and vector-free status remain uncertain

Crossref

Roskilde Universitet

Repositorio Universidad de Zaragoza

Directory of Open Access Journals

Repositori Institucional de la UIB

Agritrop

ZORA

Online Research Database In Technology